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How  good  are  global  Newton  methods.  Part  1 
A.  A.  Goldstein* 

ABSTRACT:  1)  Relying  on  a  theorem  of  Nemerovsky  and  Yuden(1979)  a  lower  bound 
is  given  for  the  efficiency  of  global  Newton  methods  over  the  class  Cl(p,  A)  defined  below. 
2)  The  efficiency  of  Smale’s  global  Newton  method  in  a  simple  setting  with  a  non-singular, 
Lipschitz-continuous  Jacobian  is  considered.  The  efficiency  is  characterized  by  2  param¬ 
eters,  the  condition  number  Q  and  the  smoothness  S,  defined  below.  The  efficiency  is 
sensitive  to  S,  and  insensitive  to  Q. 

KEYWORDS:  Global  Newton  methods,  unconstrained  optimization,  computational  com¬ 
plexity 

Global  Newton  methods  are  considered  by  some  to  be  methods  for  minimizing  a  “strongly” 
convex  function  f  defined  on  a  real  Hilbert  space  E.  Strongly  convex  means  that  f  is  twice 
differentiable  with  a  Hessian  that  is  bounded  from  above  and  below.  By  C(p,  A)  we  denote 
the  set  of  all  strongly  convex  functions  whose  Hessian  is  bounded  below  by  p  and  above 
by  A.  The  Hessian  is  invertible  so  that  Newton’s  method  is  well  defined  for  every  point  in 
E.  Moreover  a  strong  convex  function  achieves  a  minimum,  where  V f(x)  =  0.  However 
Newton’s  method  may  not  converge  to  a  root  of  V/(x)  =  0  from  arbitrary  points  in  E. 
This  is  a  raison  d'etre  for  the  Global  Newton  methods.  These  methods,  whose  ingredients 
contain  Newton  steps,  generate  sequences  that  converge  for  every  strongly  convex  functions 
and  any  starting  point  in  E.  The  convergence  rate  is  asymptotically  superlinear.  An 
early  history  of  this  subject  may  be  found  in  PoIak(1973),  who  cites  contributions  of 
Goldstein(1965),  Pshenichnyi(1970),  and  Robinson(1972).  More  recent  work  is  due  to 
Bertsekas(1982),  Dunn(1980),  Hughes  and  Dunn(1984),  and  others.  All  of  these  results 
give  estimated  asymptotic  rates  of  convergence.  Global  Newton  methods  for  finding  roots 
go  back  to  at  least  1934.  They  are  related  to  continuation  methods.  An  early  history 
and  discussion  may  be  found  in  Ortega  and  Rheinboldt(1970,p235),  who  credit  the  basic 
idea  to  Lahaye(  1934, 1948).  Current  references  may  be  found  in  Smale(19SG-2).  In  general 
we  regard  a  global  Newton  method  as  any  algorithm  incorporating  Newton  steps  that 
that  generates  a  finite  sequence  terminating  in  an  approximate  root.  This  is  a  point  from 
which  the  ordinary  Newton’s  method  will  converge.  Other  algorithms  are  available  that 
terminate  in  an  approximate  root.  The  efficiency  or  iteration  count  of  2  such  algorithms 
will  be  compared  to  a  global  Newton  method.  The  word  “algorithm”  as  used  in  this  paper 
should  be  taken  with  “a  grain  of  salt”.  We  assume  information  that  is  not  given  with  real 
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problems.  Our  excuse  for  doing  this  is  that  we  hope  thereby  to  gain  insight  and  motivation 
for  the  future  construction  of  good  algorithms. 

The  efficiency  of  a  Global  Newton  method  was  probably  first  analyzed  by  Kung(lOTG). 
using  natural  assumptions  that  imply  a  non-vanishing  Jacobian.  It  appears  that  the  next 
such  result  is  due  to  Smale(  19SG-2)  who  established  a  global  Newton  method  in  the  general 
setting  of  an  analytic  mapping  between  Banach  spaces,  both  real  or  both  complex.  We 
revisit  this  problem  below.  Our  assumptions  are  close  to  Kung’s  but  our  algorithm  follows 
Smale.  The  first  part  of  this  paper  vvi  show  that  the  class  of  strongly  convex  functions 
and  thus  any  more  generalized  classes  mat  include  the  strongly  convex  functions  are  not  a 
suitable  setting  for  Newton's  method;  hence,  also  not  suitable  for  global  Newton  methods. 
Unfortunately,  this  is  the  setting  for  the  asymptotic  convergence  proofs  mentioned  above. 

Consider  the  class  A)  having  the  following  definition.  Let  F  be  a  continuously  dif¬ 

ferentiable  map  from  a  separable  real  Hilbert  space  H  into  itself.  The  inner  product  in  H 
will  be  denoted  by  [  ,  ].  Let  D(x)  denote  the  Frechet  derivative  of  F  at  x.  By  C 1  ( // .  A) 
we  denote  the  set  of  all  maps  F  for  which  n\\h\\  <  ||D(x)/j||  <  A||/)||  for  all  h.  x  £  H. 
with  fu  >  0  .  Let  Q  —  K  Assume  that  the  linear  operator  D(x)  lias  an  inverse.  Wo 
shall  show  that  no  global  Newton  method  (or  any  other  algorithm)  can  do  better  than 
linear  convergence  at  a  certain  determined  rate  over  every  member  of  the  above  class.  Any 
algorithm  that  can  achieve  this  rate  is  called  an  optimal  algorithm.  For  the  special  case 
of  C(//.  A)  a  simple  algorithm  due  to  Nesterov  ( 1 983 )is  optimal  to  within  a  multiplicative 
constant.  The  convergence  rate  is  linear.  Nesterov's  algorithm  does  not  require  inversions; 

-it  is  similar  to  the  gradient  method.  Any  application  of  Newton's  method  requires  the 
computation  of  an  inverse  operator  or  the  solving  a  system  of  linear  equations.  If  the 
dimension  of  H  is  small  we  usually  are  willing  to  pay  the  price  of  solving  equations  to  gain 
the  possibility  of  quadratic  convergence.  The  convergence  estimates  for  the  global  Newton 
method  in  the  space  C1  (//.  A.  L )  that  is  a  subset  of  C1  (//.  A)  with  D(x )  satisfying  a  uniform 
Lipschitz  constant  L  show  arbitrarily  slow  convergence  for  sufficiently  large  values  of  L. 
For  the  special  case  when  D(x)  is  everywhere  self-adjoint  we  exhibit  a  gradient  algorithm  y 
whose  efficiency  is  insensitive  to  L  .  For  this  case  the  estimate  for  the  gradient  method  is  _ i 
superior  to  that  of  the  global  Newton  method  when  L  is  sufficiently  large.  However  the 
gradient  method  is  sensitive  to  Q.  while  the  global  Newton  method  is  not.  Tims  for  fixed  .  , 
L  and  large  enough  values  of  Q  the  situation  is  reversed. 

It  is  a  pleasure  to  thank  Brad  Bell  for  discussions  and  helpful  criticisms. 
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REMARK  1.  a).  If  F  is  £  C!(/i,  A)  then  any  stationary  point  of  ||F(x)||2  is  a  root  of  F. 

PROOF  Let  f(x)  =  [F(x),F(x)].  The  differential  f'(xji)  =  2[F(x).D(x)hj,  where  D(x)h 
=  F'{x,h).  Let  h  =  D~l  (x)F(x).  If  x  is  stationary  then  f'(x,h)  =  0  =  2[F(x),  F( x )] . 
Whence  F(x)  =  0. 

b). EXISTENCE.  In  view  of  1.  above,  if  in  addition  f  has  compact  level  sets  then  F  has 
roots. 

Let  C(p, A)  denote  the  set  of  twice  differentiable  c*..  /ex  functions  with 

hW2  <  f"(x,h,k)  <  X\\hf 


for  all  x  and  h  £  H,  and  some  positive  //  <  A.  The  class  C(/v,  A)  is  called  a  set  of  "strongly 
convex"  functions.  The  number  Q  =  A/p  is  called  the  condition  number. 

ALGORITHMS  B  y  an  algorithm  A(g)  where  g  6  C(/n  A)  we  mean  a  recurrence  relation 

that  calculates  x^+i  using  some  of  the  values  of  g.  g'  and  g"  at  xs.  s=0.1.2 . k.  with  x{) 

arbitrarily  given.  A(g)  is  a  special  case  of  a  "local  met  hod  “defined  by  Nemerovsky  and 
Yuden.  19S1.  By  an  algorithm  B(F)  defined  on  C’fp.A).  we  mean  a  recurrence  relation 
that  calculates  using  some  of  th^  values  of  F  and  F'  at  x.,.  0  <  s  <  k,  with  x() 

arbitrarily  given.  B(F)  is  also  an  instance  of  a  local  method.  We  shall  assume  that  all 
global  Newton  methods  are  B(F)  algorithms. 

Let  Cs(fi.  A)  denote  a  subset  of  (?'(//.  A)  for  which  D(x)  is  self-adjoint  with  spectral  bounds 
H  and  A  for  all  x  €  H.  For  F  £  C9(fi.  A)  we  can  associate  a  “potential"  function  f  (Yainberg 
1955)  such  that  Y/(x)  =  F(x)  for  all  x  in  H.  (Actually,  an  equivalence  class  of  functions, 
differing  from  each  other  by  a  constant).  Also  for  every  /  £  C(fi.  A)  there  corresponds  a 
F  £  CS(//.A).  The  function  f  is  weakly  lower  semi-continuous  and  the  level  sets  of  f  are 
weakly  sequentially  compact.  This,  and  the  strong  convexity  of  f  implies  that  there'  exists 
a  unique  minimizer  for  f.  say  z. 


When  any  formula  below  is  followed  by  the  word  “steps"  we  mean  that  the  formula  is  to 
be  rounded  up  to  the  nearest  integer.  We  rely  on  the  following  claim. 


THEOREM  l.)NEMEROVSI\Y- YUDEN ( 1979)  Given  a  positive  e  <  1,  a  fixed  but  arbi¬ 
trary  point  x0  £  H  and  an  algorithm  A(f),  there  exists  a  function  f  £  C(//.  A)  such  that 
if  Xk  generated  by  A(f)  reduces  ( f(ik )  —  /(-))  to  less  than  e(/(x  o)  —  /(-))  then  k  exceeds 
the  number: 


c  min(n.  v/Q)/dn  min(n,  v/Q))] ln 


-  =  R  In  -  steps 

f  f 
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Here  c  is  a  positive  constant,  Q  is  >  2,  and  z  =  argmin  f. 

REMARK  2.  For  n  and  Q  sufficiently  large  and  e  sufficiently  small  the  above  bound  may 
be  increased  to: 

c\[Q  In  - 
e 

Given  a  positive  e  <  1  and  any  function  f  £  C(/i,  A)  there  is  an  algorithm  A  that  yields 
f(xk)  —  f(z)  <  (fixo)  —  /(z))e  whenever  k  exceeds  4y/Q/\n2)~1  lne-1  —  R1  lne-1.  This 
algorithm  is  due  to  Nesterov(19S3).  It  is  essentially  optir  i.d,  and  can  only  be  improved  by 
a  decrease  in  the  constant  factor  4/ In  2.  Stated  otherwise,  the  algorithm  A  applied  to  any 
function  f  £  C(/j,  A)  generates  a  sequence  Xk  that  satisfies  (f(xk)  —  f(z))/(f(xo)  —  f(z))  < 
({e-{1/R'))k.  for  1  <  k  <  oc. 

The  algorithm  A  (that  will  be  called  GRAD1  below)  may  also  be  taken  to  be  the  gradient 
method  with  step  length  1/A  .  This  algorithm  requires  no  information  about  the  values 
of  the  function  f,  while  Nesterov’s  does.  Observe  that  the  linearly  converging  sequence 
(e-<1  /R))k,  k  =  1.2.3....  is  for  each  k  a  lower  bound  for  the  relative  decrease  of  some 
function  f  in  C(fi.X)  at  Xk  .  while  the  sequence  * )k  is  an  upper  bound  for  the 

relative  decrease  for  any  function  in  C(p.  A).  For  the  gradient  method  above  the  sequence' 

is(e-fj/<?y. 

This  prompts  us  to  call  the  class  C(//.  A)  "esslinearly  convergent",  that  is  every  function  in 
the  class  can  be  made  to  converge  no  slower  than  linearly,  but  sup  {(/(.?'t-)  —  /(-))/(/0'u)  — 

/(;)):/£  C(//.A)}.  k=1.2.3 . cannot  converge  faster  than  linearly.  For  brevity  we  shall 

refer  to  this  latter  property  as  "sublinearly  convergent".  We  now  observe  that  the  class 
Cs{fi.  X)  is  also  esslinearly  convergent. 

LEMMA  1  Given  F  £  C9{fi,  A),  let  f  denote  any  potential  function  for  F.  Let  z  =  argmin 
f.  The  following  inequalities  obtain: 


(2 Qrl[(f(x)  -  f(z))/(f(x 0)  -  /(c))]1/2  <  ||F(x)||/||F(x0)| 
<  2Q[(f(x)-f(z))/(f(xo)-f(z))]''2  (.4) 


Moreover,  if 

f(r)  ~  f(  =  )  <  f2  (/(fo)  -  /( - )  )/4Q3  then  |jF(  x)\\  <  e  ||F(.r„  )|j  ID) 


0 


PROOF  By  the  strong  convexity  of  f  and  Taylor’s  theorem  we  get: 


h 

2 


z\\2<f{x)-f(z)< 


A 

2 


(a) 


By  the  generalized  mean  value  theorem  and  the  convexity  of  f  we  get: 


||F(*)||  <  A||*  -  z\\  ( b ) 

and 

/(*)-  (2)  <  \\F{x)\\  ||i  -  z||  (c) 


By  (a)  and  (b)  we  have: 


11^)11 


\\FMW 


<  A 


2(f(x)-f(z))] 


1/2 


( d ) 


By  (a)  and  (c)  we  get 

and 


\\F(x)\\  >  (dk  --11  (£) 


||F(x)||  >  (/) 

To  prove  (B)  we  find  that  using  the  hypotheses  of  (B)  together  with  (d)  that 


11  F(t)\\ 
11^(2-0)11 


<  A 


2 €2(/(.t0) -/(;)) 

[  4/iQ3||F(x0)H2 


0/2 


Using  (e)  we  find  that  the  right  hand  side  is  less  than  or  equal  to 


A 


f  2e2(/(.r0)  —  /(c)) 


0/2 


L'WVIIxo  -  ~||2/4_ 

Now  using  (a)  the  above  expression  is  less  than  or  equal  to  e. 


We  now  turn  to  the  proof  of  (A).  Using  (f)  and  (d)  we  find  that 

[2  H F ( x )||  ^  (f{x)  -  f(z))}/2  >  f  f(x)-f(z)  V/2  y77 
V  0  l|F(x0)||  -  ||  F(  x0 )  ||  -\f(x0)-f(z)J  y/o  A 

.  This  proves  the  left  side  of  (A).  The  right  hand  inequality  is  proved  similarly,  us 
and  (f). 


LEMMA  2  The  class  C,(  //.  A)  is  esslinear. 


G 


PROOF  Let  f  be  a  potential  function  corresponding  to  F.  Every  algorithm  B(F)  is  now 
also  an  algorithm  A(f).  Every  function  F  G  Cs(p,  A)  is  the  gradient  of  some  /  G  C(p,  A). 
Hence  for  some  F  .  ||F(xjt)||  /  ||F(x0)!|  converges  more  slowly  than  (e~(1lR,)kl2  /2Q.  Now 
take  for  the  algorithm  B  the  gradient  algorithm  mentioned  above.  Again  by  LEMMA  1 
every  function  F  will  converge  under  B  with  at  least  a  linear  rate. 

Since  Cs(/J,  A)  is  a  subset  of  Cx{p,\),  then  for  some  F  £  C1(/x,  A).  ||F(xfc)||  /  ||F(xo)|| 
converges  more  slowly  than  )k/2 /2 Q.  Now  for  B(F)  we  take  the  algorithm  GRAD2 

below.  This  algorithm  converges  linearly.  Whence  we  have 

THEOREM  2.  The  class  CJ(p,  A)  is  sublinearlv  convergent. 

vVe  now  restrict  the  class  C'(p,  A)  to  enlarge  the  possibility  of  faster  convergence.  Let 
C1{p.  A,  L)  denote  a  map  Fg  Cl{p,  A)  for  which  j|F(x)  —  D(y)||  <  L  ||x  —  y(|.  for  all  x.y  G 
H.  The  following  well-  known  theorem  is  adjusted  for  our  present  setting. 

THEOREM  3  KANTOROVICH(1948)  Take  x0  6  H..  Let  $(x0)  =  ||(F>_1  (-*'u )  )||  and 
y(J’o)  =  ||(F_1(r0))F(r0)||.  Assume  that  \\D(x)  -  D(y)\\  <  A(r0)||.r  -  y||  for  all  pairs  x.y 
in  the  ball  B(x0)  =  {r  G  H  :  ||r  -  r0||  <  2y(x0)}.  If  y(x 0  )/?(x0  )A(x0 )  =  h(. r0)  <  1/2. 
then  F  has  a  root  z  such  that  z  is  in  the  ball  B(x o),  the  Newtonian  iterates  Xj  defined  by 
x;  +  i  =  Xj  -  D~x(xj  )F{Xj)  lie  in  B(x0),  and  jjxj  -  x||  <  2l~](2h(x0 ))2’  ~l7j(x0). 

A  convenient  terminology'  similar  to  Smale's  is  that  under  the  above  circumstances  x0  is 
an  "approximate  root". 

In  what  follows  we  shall  take  h (j0)  —  1/4. 

REMARK  3.  The  condition  for  an  approximate  root.  y(.?'o  )d(x^  )A(xy )  <  1/4  has  the 
equivalent  condition  for  an  approximate  root  as: 

IIFfx 0)|!  <  o(-fo)  =  l/[4.-?(x0  )A(x0)IIB~1(x0)F(x0)/IIF(x0)ll  ||] 


REMARK  4.  We  have  for  all  x  G  H  global  estimates  for  r/(x).  $(x)>  onr/  A(x).  Namely: 
3(x)  <  1///.  y(x)  <  ||F(x)||//<  one/  A(.r )  <  L.  From  these  estimates  we  get: 

cv(  x )  >  p2/4£  =  n 

If  ||F(x)jj  <  q  then  x  is  an  approximate  root ,  and 
Pi?)  <  h/4 L. 


In  many  problems  a  is  so  small  that  the  desired  accuracy  tolerance  is  achieved  before  an 
approximate  root  is  achieved.  Thus  the  efficiency  of  a  global  Newton  method  in  reaching 
an  approximate  root  is  a  crucial  question.  We  now  turn  to  our  version  of  Smale's  algorithm 
which  we  shall  denote  by  “SGN”.  In  what  follows  a(x,)  will  be  denoted  by  a,. 

REMARK  5.  In  what  follows  the  constants  [i.  A,  and  L  need  not  be  finite  over  the  entire 
space  H,  but  rather  on  the  set  5  =  {x  6  H  :  ||F(x)||  <  ||F(xo)||}. 

LEMMA  3.  A  ime  F  6  A)  and  Xo  is  arbitrarily  given  in  H.  If  j|F(x0)!j  <  o0  then 

xo  is  an  apprc  nate  root.  If  not  we  define  a  sequence  io,  t j.  xj.  1 2 .  inductively  as 

follows.  Given  x,  set 

,  _  iftfi 

1+1  ||F( 

Choose  x1+i  to  satisfy 

||F(x,+1 )  -  <1+iF(x,)|!  <  a,/2  (h) 


-  a, 


(a) 


Then 


|F(x1+1)||  <  ||F(x0)||  -  (i  +  l)n/2 


where  o  is  defined  as  in  Remark  4. 


PROOF.  We  show  first  that  xI+1  can  be  chosen  to  satisfy  (b).  Let  G,(.r)  ~  F{.r)  - 

f,+1F(x,).  Since  G,(.r, )  =  a,,  x,  is  an  approximate  root  for  G,.  because  F'  =  G' .  A  tew 

Newton  steps  (we  count  them  below)  suffices  to  obtain  x,+i  such  that  |jG',( 1  )j|  <  o,/2. 
Thus  (b)  can  be  satisfied.  Using  the  triangle  inequality  on  (a)  together  with  (b)  we 
get  that  l|F(xj+1)||  <  ||F(x,)||  -  q,/2.  Whence  ||F(x,+]  )||  <  ||F(x0)||  -  C 

F(xo)  —  (t  +  1)q/2.  Now  choose  i  so  that  j|F(xI+]  )||  <  o 

CLAIM  1  Let  N  be  the  least  integer  exceeding  2(||F(x0)|j  —  o)/n.  Then  for  some  /  <  A  .  x, 
is  an  approximate  root  of  F. 

We  now  estimate  the  number  of  Newton  steps  to  move  from  x,  to  x,4-i  - 

LEMMA  4  Let  {ytJ}  be  a  sequence  of  Newtonian  iterates  starting  at  yl0  =  x,.  Let  G,(z,  1  = 
0.  Then  we  can  choose  xl+1  =  y, /v  where  I\  is  the  least  integer  >  1.443  In  ( 1.443  In  S(J  <. 

PROOF  We  have  seen  that  x,  is  an  approximate  zero  for  G,  hence  ((y,j  -  x,|l  <  )* 

Then 

l|C,(!/„)-G,j.,)||  <  -  2,||  <  Xt.L-'qf 

Now  choose  K  so  that  A/rL-1(^)2  <  a/2,  that  is  ( |  )2  <  1  /( S  Q). 
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REMARK  6.  The  above  algorithm  can  be  optimized  by  changing  the  right  hand  side  oi' 
inequality  (b)  in  the  recursion  above  to  a/q  with  q  >  1.  The  formula  for  N  becomes 
(j{F(j-0)||  —  a)q/a)  and  K  becomes  1.433  In  ( 1.433  In  4qQ).  Now  choose  q  to  minimize  NK. 

LEMMA  5  Tak<’  F  G  Cl  (p,  A,  L)  Assume  that  D(x)  is  self-adjoint  for  all  x  G  H.  The 
gradient  method  previously  mentioned  below  REMARK  2,  called  Algorithm  A.  that  we 
shall  now  call  “GRAD1"  will,  starting  at  :ro,  generate  an  approximate  root  in  K  steps, 
where  I\  is  the  smallest  integer  >  [l|F(xo)||4 QL/fi2} 

PROOF  The  mapping  G  defined  by  G(y)  =  y  —  F{y)/\  has  a  fixed  point  z  satisfying  F(z) 
=  0.  It  is  a  contractor  satisfying  a  Lipschitz  condition  <7=1  —  1  jQ.  Gold -feinf  19G7.  pps 
15  and  24).  Set  G[xn )  =  a„  +  1  =  xn  -  F(jn)/A.  Then  j|F(.r„)  -  F(r)||  <  A||x7)  -  r||  < 
A<7t* Ii'j-o  -  -fi  ||/(1  -  q)  =  ||F( J’o  )||*y"/(  1  -  q)-  Now  choose  n  so  that 

||F(.r0)||</n  <  o(l  -  q) 

Using  the  inequality  —  In  1 1  —  ///A)  >  ///A.  one  obtains  the  lemma. 

We  now  consider  a  gradient  method  for  the  non-symmetric  case.  We  call  this  gradient 
method  GRAD2. 

ALGORITHM  GRAD2  Take  F  G  C\y.  A.  I ).  ,r0  G  Sand  set  f(x'  =  \\F(  x  )j|* .  Then  Vf\x) 
satisfies  ||V/(.r)  -  V/(t/)|j  <  M\\x  -  t/||  for  all  x  and  y  in  S.  with  M  =  2{\~  4-  j  F\  ,ru  i  \L). 
Set  o(x)  =  f(x)V  f{x)/\\V  f(x)\\2  .  Given  arbitrary  xq  in  H  set  xk+i  =  x  k  -  ~(>0(xki  with 
*0  =  fi2 /2M .  If  k  exceeds 


then  xk  is  an  approximate  root. 

PROOF  Adding  and  subtracting  ( F'{  t/))*  F(  x)  we  find  that  ||V/(.r)  —  V/(j/)|j  <  2(  AJ  -f 
I! F(  x  )j|  I )  || x  -  y j, .  and  /( )  -  /( x  -  ~,o{x ) )  --  q  [V/(  .r ).  c>(x  )j  +  *  [V/(  ,r )  -  V /(  f  ).  oi  x  )]. 
Here  |l£  —  .r|j  <  '>  \/j|c(.r  )|!.  Then  f(x  —  ~,o{x)\  <  f(x )  —  -; /( x )  +  *  *M/||o(  x  V/(.r)  = 
2(F'(x))*  F{x),  and  ||V/(j’)||  >  2j|F(.r  )||//.  Then  f(xk+i)  <  f(xk)[  1  -  -;0  +  '0  M/4/r]  = 
f{xk){  1  -  p2/-M).  Taking  square  roots  we  get  that  ||F(jt+1)|i  <  ||F(.t*  )j|(  1  -  ;/2/2  M). 
Finally,  choose  k  so  that  ]|F( .r() )||(  1  —  ii2/2M)k  <  p2/4L. 


Comparing  the  algorithms.  Let 


5  = 


||F(.r0)||I 


By  claim  1  and  lemma  4  the  total  number  of  steps  of  SGN  is: 
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SGN  :  11.544(5  —  .25)  ln(1.44C  In  SQ). 


GRAD2  :  2(5+  Q2)  In 45 


GRAD1  :  Q\n(4SQ) 


Notice  that  unlike  GRAD1  and  GRAD2,  SGN  is  insensitive  to  the  condition  number  0! 
However  SGN  is  sensitive  to  S.  GRAD  1  is  sensitive  to  Q  but  not  to  S.  GRAD2  is  sensitive 
to  both  of  these  factors.  In  the  symmetric  case  for  fixed  Q,  GRAD1  is  quicker  than  SGN 
when  ||F(xo)||  grows  sufficiently  large  or  if  L/jj. 2  gets  sufficiently  large.  On  the  other 
hand  for  fixed  S,  SGN  is  quicker  as  Q  grows  sufficiently  large.  In  the  non-symmetric  case 
SGN  is  superior  to  GRAD2  with  respect  to  the  number  of  steps.  When  the  cost  per 
step  is  included,  the  gradient  methods  become  cheaper  in  the  n-dimensional  case  when  n 
is  sufficiently  large.  For  each  Newton  step  an  nxn  system  of  linear  equations  is  solved, 
costing  0(n3)  multiplications.  While  the  corresponding  GRAD2  step  involves  a  matrix 
multiplication  of  an  nxn  and  a  nxl  matrix,  or  n?  i  lultiplications.  and  GRADl  requires  no 
matrix  operations. 
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